You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Parquet files often contain columns with highly repetitive values (e.g., status codes, categories, constant metadata fields). Currently, Arrow reads these into dense arrays, materializing every value and consuming significant memory.
This PR implements direct reads of Parquet RLE (Run Length Encoded) data into Arrow REE (Run End Encoded) representation as described in #32339, using the existing read_dictionary API as inspiration for interface level changes. Like read_dictionary, this feature is currently only supported for columns with a Parquet physical type of BYTE_ARRAY, such as string or binary types.
Example usage:
importpyarrow.parquetaspq# These columns will be directly read as Arrow run-end-encoded without full materialization if their Parquet representation was run-length-encoded.table=pq.read_table('data.parquet', read_ree=['category', 'status'])
This is a considerably hefty feature so please let me know if there's anything I can do to help the review process (e.g. by splitting this into multiple smaller PRs). The way I implemented this is by adding a GetNextValueAndNumRepeats API to the RleBitPackedDecoder which skips materialization of all values for RleRuns while still going through BitPackedRuns bit by bit. I'm definitely open to suggestions on other approaches here.
Regarding performance, I am anecdotally observing an order of magnitude (i.e. ~10x) speedup for reading columns which have lots of repeated values and a slight performance degredation for columns which contain purely unique values when read with this feature enabled (noting that the feature is toggled by the user, so presumably they would have a good understanding of the shape of their data to decide whether to enable/disable this feature). I haven't done any scientific benchmarking outside of this; let me know if that would be helpful.
What changes are included in this PR?
Add ArrowReaderProperties::set_read_ree() / read_ree() methods to enable REE reading per-column
Implement GetNextValueAndNumRepeats() and GetNextValueAndNumRepeatsSpaced() methods inRleBitPackedDecoder
Add ByteArrayReeRecordReader for decoding Parquet BYTE_ARRAY columns to RunEndEncodedArray
Support REE decoding for both Plain and RLE_DICTIONARY encodings
Add Python bindings via read_ree_columns parameter in ParquetDataset and ParquetFile
lesterfan
changed the title
GH-32339: [C++][Parquet] Implement direct reads of Parquet RLE encoded data into Arrow REE
GH-32339: [C++][Python][Parquet] Implement direct reads of Parquet RLE encoded data into Arrow REE
Apr 10, 2025
Tagging @pitrou and @raulcd for some initial feedback on the implementation. I'm happy to split this into smaller PRs if that's easier for review (though guidance on how to split it would be appreciated), or make design changes based on your input. Tagging you both since we've worked together previously and I see you've reviewed recent REE changes, but feel free to suggest other reviewers if more appropriate.
@pitrou I rebased this PR on current main. Let me know if there's anything else I can do to help review here.
I also tested this again locally with a Parquet file with many repeated values and am anecdotally observing a ~10x speedup in reads using this code path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Parquet files often contain columns with highly repetitive values (e.g., status codes, categories, constant metadata fields). Currently, Arrow reads these into dense arrays, materializing every value and consuming significant memory.
This PR implements direct reads of Parquet RLE (Run Length Encoded) data into Arrow REE (Run End Encoded) representation as described in #32339, using the existing
read_dictionaryAPI as inspiration for interface level changes. Likeread_dictionary, this feature is currently only supported for columns with a Parquet physical type ofBYTE_ARRAY, such as string or binary types.Example usage:
This is a considerably hefty feature so please let me know if there's anything I can do to help the review process (e.g. by splitting this into multiple smaller PRs). The way I implemented this is by adding a
GetNextValueAndNumRepeatsAPI to theRleBitPackedDecoderwhich skips materialization of all values forRleRuns while still going throughBitPackedRuns bit by bit. I'm definitely open to suggestions on other approaches here.Regarding performance, I am anecdotally observing an order of magnitude (i.e. ~10x) speedup for reading columns which have lots of repeated values and a slight performance degredation for columns which contain purely unique values when read with this feature enabled (noting that the feature is toggled by the user, so presumably they would have a good understanding of the shape of their data to decide whether to enable/disable this feature). I haven't done any scientific benchmarking outside of this; let me know if that would be helpful.
What changes are included in this PR?
ArrowReaderProperties::set_read_ree()/read_ree()methods to enable REE reading per-columnGetNextValueAndNumRepeats()andGetNextValueAndNumRepeatsSpaced()methods inRleBitPackedDecoderByteArrayReeRecordReaderfor decoding ParquetBYTE_ARRAYcolumns toRunEndEncodedArrayPlainandRLE_DICTIONARYencodingsread_ree_columnsparameter inParquetDatasetandParquetFileAre these changes tested?
Yes, through included C++ unit tests and pytests.
Are there any user-facing changes?
Yes.